Cost Functions
Cost functions
- A cost function determines the "cost" (or penalty) of estimating
when the true or correct quantity is really . - This is essentially the cost of the error between the true stimulus value
and our estimate . - cost function formula
where
Cost vs. Loss:
loss applies to a single training sample; cost is the mean of summed loss.
Forms of cost functions
Note that the error can be defined in different ways:
- Find more types of error in Error Metrics.
- In ML, Mean Squared Error is commonly used as the cost function, but with an extra division by 2, which "is just meant to make later partial derivation in gradient descent neater" :
Cost function with regularization
When you choose Regularization, a regularization term will be added to the cost function, in order to add penalty and avoid overfitting.
- Example - linear regression:
- Example - logistic regression:
where
- Different types of regularization terms can be added:
-
L1 regularization = train to minimize normal loss + c * L1(weights)
-
L1: sum of the absolute values of the weights; like lasso regression
-
Drives some weights to 0
-
good for models with fewer features, each of them has a large or median effect
-
-
L2 regularization = train to minimize normal loss + c* L2(weights)
-
L2: sum of the squares of the weights; like ridge regression
-
Makes the biggest weights smaller
-
heavily punishing “outliers”, which are the very large parameters
-
good for models with many features, each of them has a small effect
-
-
Train to minimize normal loss - but don’t let the weights get too big
- Like an L-infinity penalty
-
- In practice, L1 regularization produces a sparse model, a model that has most of its parameters equal to zero, provided the hyperparameter C is large enough. So L1 performs feature selection by deciding which features are essential for prediction and which are not. That can be useful in case you want to increase model explainability.
- However, if your only goal is to maximize the performance of the model on the holdout data, then L2 usually gives better results. L2 also has the advantage of being differentiable, so gradient descent can be used for optimizing the objective function.
Loss and cost for different functions
Loss and cost for linear regression -> Analytic solution
in matrix form, with
The solution will only be unique when the matrix
Loss and cost for logistic regression
MSE is not proper because the cost function would not be convex.
loss function
Combined the above formula together, we get the simplified cost function for logistic regression:
then the cost function with full form (also used in Maximum likelihood estimation for logistic regression):
Loss and cost for Softmax
What is Softmax: Artificial Neural Networks#^6ef895
The loss function associated with Softmax, the cross-entropy loss, is:
Only the line that corresponds to the target contributes to the loss, other lines are zero:
$$\mathbf{1}{y == n} = =\begin{cases}
1, & \text{if
0, & \text{otherwise}.
\end{cases}$$
Cost function:
where
Cross-entropy takes the full distribution into account.
Expected loss function
A posterior distribution tells us about the confidence or credibility we assign to different choices. A cost function describes the penalty we incur when choosing an incorrect option. These concepts can be combined into an expected loss function.
Expected loss is defined as:
where
- The posterior's mean minimizes the mean-squared error.
- The posterior's median minimizes the absolute error.
- The posterior's mode minimizes the zero-one loss.
Good Practice in minimizing loss function
- be aware of Clever Hans Effect: model learns to minimize loss functions, but it does not learn the thing it should learn...